White Wine Analysis by R

This report explores a dataset containing about 4,900 white wines with 11 variables and 1 output variable on quantifying the chemical properties of each wine.

Univariate Plots Section

Dimension

## [1] 4898   13

Adding a new column “rating” to the dataframe

This column gives an overall rating of the wines, according to the following scale:
poor: quality <= 4
average: 5 >= quality >= 6
good: quality >= 7

Structure of the dataframe

## 'data.frame':    4898 obs. of  14 variables:
##  $ Count               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ rating              : Ord.factor w/ 3 levels "poor"<"average"<..: 2 2 2 2 2 2 2 2 2 2 ...

Summary

##      Count      fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol      quality      rating    
##  Min.   : 8.00   3:  20   poor   : 183  
##  1st Qu.: 9.50   4: 163   average:3655  
##  Median :10.40   5:1457   good   :1060  
##  Mean   :10.51   6:2198                 
##  3rd Qu.:11.40   7: 880                 
##  Max.   :14.20   8: 175                 
##                  9:   5

Histogram: x = quality, x = rating

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
##    poor average    good 
##     183    3655    1060

The “quality” here is the output variable and is rated between 0 (very bad) to 10 (very excellent). The histogram here shows us that no white wine in the data has been rated below 3 or 10.

Most of the wines have an “average” rating, leaving the “poor” and “good” wines looking like outliers. This raises few questions, like:

How accurate is the data?
Were the wines all selected randomly for testing purposes and how many brands were involved or what’s the variety in general?
What’s the age of the wine or how old were they when tested?

According to the document, the quality is decided by the wine experts based on the median of at least 3 evaluations, so it’s not quite clear what factors were taken into consideration for deciding the quality. We’ll now look at all the variables individually to get a better understanding of it.

Histogram: x = fixed.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The plot for fixed.acidity can be either seen as slightly positively skewed or having a normal distribution, with mean = 6.855 and the median = 6.800. Outliers removed.

Histogram: x = volatile.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

volatile.acidity is positively skewed. Wines with lower values of volatile.acidity is quite common, as higher levels of it can lead to unpleasant taste.

Histogram: x = citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

As per the documentation, citric acid is found in small quantities, that explains the lower values.

Histogram: x = residual.sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

residual.sugar graph is positively skewed, with a significant spike at 1.5 g/dm^3, and the scales have been adjusted to remove the outliers present at higher values.

Histogram: x = chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most wines have low amount of sodium chloride in them. Outliers removed.

Histogram: x = free.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Most wines contain about 34 - 35 mg/dm^3 of free sulfur dioxide. The plot has been re-scaled to remove the outliers of higher values.

Histogram: x = total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

total.sufur.dioxide is the combination of free and bound forms of SO2, and follows nearly the same pattern as free.sulfur.dioxide. Outliers removed.

Histogram: x = density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The plot for the density is a normal distribution, with median = 0.9937 and mean = 0.9940. Outliers removed.

Histogram: x = pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

From the plot above we can see, most of the wines are on the pH scale between 3 to 3.5. The distribution is almost normal with median = 3.180 and mean = 3.188.

Histogram: x = sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The plot looks like bimodal with peaks at 0.39 and 0.48.

Histogram: x = alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Positively skewed graph. Plot shows most white wines contain about 9 - 12% of alcohol in them.

Univariate Analysis

What is the structure of your dataset?

White wine data contains 4898 observations under 14 variables. The 14th variable, “rating”, was added to generalize the “quality” scores into 3 categories: “poor”, “average”, “good”. As it turns out, most of the wines have an average rating, with a whooping 3655 wines under this category and it could be either we are working on a sample of the actual data set or the data is incomplete or the testings were done under certain constraints like brands, location etc.

What is/are the main feature(s) of interest in your dataset?

One of my main interests is to find which variable or combination of variables strongly affect the quality or rating of a wine. Also, how the variables affect each other, for instance how the density of a wine is affected by alcohol and sugar content.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

At this point, I believe features like acidity (fixed and volatile), residual sugar, alcohol content may have crucial effect on the pH and quality of a wine.

Did you create any new variables from existing variables in the dataset?

Yes. The “rating” variable was added to the dataframe based on the “quality” scores.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Most of the plots either have a normal distribution or are positively skewed. In that respect, I found the alcohol and residual sugar variables’ distribution to be different with it’s several spikes along the scale, which makes it a bit difficult to relate it’s effect on the wine quality. Perhaps more analysis alongside of other variables can help bring out more reasoning to the pattern.

Except for adding of the new ordered variable “rating” to the dataframe, there has been no adjustment made to the original data. While building the histograms, outliers were removed for some of the variables to get a better readability on the plots:

  1. fixed.acidity and density have a normal distribution with outliers removed.
  2. volatile.acidity, citric.acid and alcohol are positively skewed.
  3. residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide are all positively skewed with outliers removed.
  4. pH has a normal distribution.
  5. sulphates has bimodal look to it.
  6. fixed.acity, residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide have long-tailed outlier values.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.28918070
## volatile.acidity       -0.02269729       1.00000000 -0.14947181
## citric.acid             0.28918070      -0.14947181  1.00000000
## residual.sugar          0.08902070       0.06428606  0.09421162
## chlorides               0.02308564       0.07051157  0.11436445
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.09407722
## total.sulfur.dioxide    0.09106976       0.08926050  0.12113080
## density                 0.26533101       0.02711385  0.14950257
## pH                     -0.42585829      -0.03191537 -0.16374821
## sulphates              -0.01714299      -0.03572815  0.06233094
## alcohol                -0.12088112       0.06771794 -0.07572873
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
##                        sulphates     alcohol
## fixed.acidity        -0.01714299 -0.12088112
## volatile.acidity     -0.03572815  0.06771794
## citric.acid           0.06233094 -0.07572873
## residual.sugar       -0.02666437 -0.45063122
## chlorides             0.01676288 -0.36018871
## free.sulfur.dioxide   0.05921725 -0.25010394
## total.sulfur.dioxide  0.13456237 -0.44889210
## density               0.07449315 -0.78013762
## pH                    0.15595150  0.12143210
## sulphates             1.00000000 -0.01743277
## alcohol              -0.01743277  1.00000000

Correlation with quality

##              density            chlorides     volatile.acidity 
##         -0.307123313         -0.209934411         -0.194722969 
## total.sulfur.dioxide        fixed.acidity       residual.sugar 
##         -0.174737218         -0.113662831         -0.097576829 
##          citric.acid  free.sulfur.dioxide            sulphates 
##         -0.009209091          0.008158067          0.053677877 
##                   pH              alcohol 
##          0.099427246          0.435574715

Above vector shows the order in which all the variables affect the quality of a wine, with free.sulfur.dioxide and alcohol having least and highest effect respectively. Apart from alcohol, we can see that density, chlorides and volatile acidity too have an impact on the wine.

(Articles from http://www.brsquared.org/wine/Articles/SO2/SO2.htm and https://winemakermag.com/501-measuring-residual-sugar-techniques were referred to get an initial understanding of the various components of wine)

Following are some of the correlations that stands out with the absolute coefficient > 0.3:

  1. pH <– - 0.4259 –> fixed.acidity: Higher acidity has lower value on pH scale.
  2. total.sulfur.dioxide <– - 0.4489 –> alcohol: There is a loss of total sulfur dioxide during the alcoholic fermentaion, so a negative correlation.
  3. total.sulfur.dioxide <– 0.5299 –> density
  4. total.sulfur.dioxide <– 0.6155 –> free.sulfur.dioxide: As total sulfur dioxide is combination of free and bound forms of sulfur dioxide.
  5. total.sulfur.dioxide <– 0.401 –> residual.sugar: According to the article, wines with residual sugar might be better protected with higher levels of sulfur dioxide.
  6. density <– - 0.7801 –> alcohol
  7. quality <– 0.4356 –> alcohol: Does more alcohol content make better wine?
  8. quality <– - 0.3701 –> density: As per the documentation, density is dependent on the sugar and alcohol content of wine and density is negatively correlated to alcohol, where as alcohol is positively correlated to quality, perhaps it may imply that density affects quality negatively.
  9. density <– 0.839 –> residual.sugar: Pair with strongest coorelation.
  10. alcohol <– - 0.3602 –> chlorides
  11. alcohol <– - 0.4506 –> residual.sugar: During the alcoholic fermentaion, part of the sugar gets converted to alcohol and rest is residual sugar. So negative correlation is in this case is expected.

Scatterplot Matrices (on sample data)

quality vs. fixed.acidity

fixed.acidity doesn’t affect the quality much. (cor.coeff = -0.114)

quality vs. volatile.acidity

Much similar to fixed.acidity, volatile.acidity has little effect on quality. (cor.coeff = -0.195)

quality vs. citric.acid

citric.acid has no effect on the quality of a wine. (cor.coeff = -0.009)

quality vs. residual.sugar

residual.sugar has little to no effect on the quality. (cor.coeff = -0.098)

quality vs. chlorides

With decreasing amount of chlorides, there is a subtle improvement in the quality. (cor.coeff = -0.210)

quality vs. free.sulfur.dioxide

free.sulfur.dioxide has a very little effect on the quality. (cor.coeff = 0.008)

quality vs. total.sulfur.dioxide

The total.sulfur.dioxide is inversely proportional to the wine’s quality, given that the free.sulfur.dioxide was in positive correlation with the it. (cor.coeff = -0.175)

This is because bound form of sulfur dioxide in total.sulfur.dioxide is negatively correlated to the quality. (cor.coeff = -0.218)

quality vs. density

As density increases, the perceived quality reduces. Further analysis maybe needed to examine just exactly how much density has an effect on the quality of wine. (cor.coeff = -0.307)

quality vs. pH

Very slightly so, higher value pH is good for the quality of wine. (cor.coeff = 0.099)

quality vs. sulphates

Quality of wine stays quite steady with the varying amount of sulphates. (cor.coeff = 0.054)

quality vs. alcohol

This is where we see the strongest impact on the quality of wine. Higher content of alcohol means better tasting wine. (cor.coeff = 0.436)

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = white_wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.582009   0.098008   5.938 3.08e-09 ***
## alcohol     0.313469   0.009258  33.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Alcohol has about 18% of the effect on the wine quality. Below, is a more detailed breakdown of alcohol level depending on the quality (density plot).

We’ll now examine few plots between variables with strong correlation.

pH vs. fixed.acidity (cor.coeff = -0.426)

total.sulfur.dioxide vs. density (cor.coeff = -0.449)

total.sulfur.dioxide vs. free.sulfur.dioxide (cor.coeff = 0.616)

total.sulfur.dioxide vs. residual.sugar (cor.coeff = 0.401)

alcohol vs. density (cor.coeff = -0.780)

density vs. residual.sugar (cor.coeff = 0.839)

alcohol vs. chlorides (cor.coeff = -0.360)

alcohol vs. residual.sugar (cor.coeff = -0.451)

This relation makes sense as with breakdown of sugar, more alcohol is produced and thus a negative correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

alcohol has the most impact on the wine quality, even though it’s just 18%. It was interesting to analyze the relationship between alcohol and density, alcohol and residual.sugar, quality and pH as well, which goes to confirm that better wines have high alcohol content, low density and low residual sugar.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

quality vs. pH, quality vs. fixed.acidity, quality vs. volatile.acidity, quality vs. citric.acid, positive correlation with pH and negative correlations with fixed.acidity and volatile.acidity, was an interesting thing to come across, suggesting good white wines might have low acidic value or higher pH values.

What was the strongest relationship you found?

density vs. residual.sugar, with correlation coefficient over 0.8.

Multivariate Plots Section

Since alcohol plays one of the major roles in deciding the wine’s quality, we’ll now analyze what affects the alcohol level.

alcohol vs. pH by rating

A little high pH and even higher alcohol for better quality wine.

alcohol vs. citric.acid by rating

Not much correlation between alcohol and citric.acid.

alcohol vs. fixed.acidity by rating

Higher alcohol and less acidity for producing better wines.

alcohol vs. volatile.acidity by rating

Unlike, the fixed.acidity, volatile.acidity seems to have a positive effect on the alcohol, which is a strange phenomena since pH level has a positive correlation with alcohol and with increase in pH level acidity decreases. However, more data is required to confirm this conclusion.

fixed.acidity vs. volatile.acidity by rating

With a negative correlation of -0.022, more data is needed to make sure if they have any significant effect on each other.

alcohol vs. sulphates by rating

Not much correlation between alcohol and sulphates.

alcohol vs. total.sulfur.dioxide by rating

Lower total.sulfur.dioxide for better wine.

alcohol vs. residual.sugar by rating

Low residual.sugar is equivalent to more alcohol content and thus better wine. But it appears like residual.sugar in itself might not have much effect on the wine quality.

alcohol vs. density by rating

As per the document, density is dependent on the alcohol level and as established alcohol has an high impact on the quality of wine. From the above plot, it is quite evident that higher alcohol leads to the dip in the density. It might be the case that alcohol has an impact on both the density and quality, rather than density alone affecting the quality.

Linear model on the variables that strongly affect the quality of the wine

## 
## Calls:
## m1: lm(formula = I(as.numeric(quality)) ~ I(alcohol), data = white_wine_data)
## m2: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density, data = white_wine_data)
## m3: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density + 
##     chlorides, data = white_wine_data)
## m4: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density + 
##     chlorides + volatile.acidity, data = white_wine_data)
## m5: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density + 
##     chlorides + volatile.acidity + total.sulfur.dioxide, data = white_wine_data)
## m6: lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density + 
##     chlorides + volatile.acidity + total.sulfur.dioxide + fixed.acidity, 
##     data = white_wine_data)
## 
## ============================================================================================================
##                              m1            m2            m3            m4            m5            m6       
## ------------------------------------------------------------------------------------------------------------
##   (Intercept)               0.582***    -24.492***    -23.150***    -37.573***    -32.759***    -45.308***  
##                            (0.098)       (6.165)       (6.162)       (6.010)       (6.295)       (6.493)    
##   I(alcohol)                0.313***      0.360***      0.343***      0.389***      0.391***      0.407***  
##                            (0.009)       (0.015)       (0.015)       (0.015)       (0.015)       (0.015)    
##   density                                24.728***     23.671***     38.217***     33.251***     46.423***  
##                                          (6.079)       (6.074)       (5.926)       (6.234)       (6.458)    
##   chlorides                                            -2.382***     -1.300*       -1.370*       -1.383*    
##                                                        (0.558)       (0.542)       (0.543)       (0.540)    
##   volatile.acidity                                                   -2.043***     -2.070***     -2.108***  
##                                                                      (0.111)       (0.111)       (0.111)    
##   total.sulfur.dioxide                                                              0.001*        0.001*    
##                                                                                    (0.000)       (0.000)    
##   fixed.acidity                                                                                  -0.099***  
##                                                                                                  (0.014)    
## ------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.195         0.248         0.249         0.257     
##   adj. R-squared            0.190         0.192         0.195         0.247         0.248         0.256     
##   sigma                     0.797         0.796         0.795         0.768         0.768         0.764     
##   F                      1146.395       583.290       396.315       402.956       324.034       281.812     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5822.011     -5657.292     -5654.027     -5627.454     
##   Deviance               3112.257      3101.773      3090.247      2889.234      2885.385      2854.246     
##   AIC                   11684.782     11670.255     11654.021     11326.584     11322.054     11270.908     
##   BIC                   11704.272     11696.241     11686.504     11365.563     11367.530     11322.880     
##   N                      4898          4898          4898          4898          4898          4898         
## ============================================================================================================
## 
## Call:
## lm(formula = I(as.numeric(quality)) ~ I(alcohol) + density + 
##     chlorides + volatile.acidity + total.sulfur.dioxide + fixed.acidity, 
##     data = white_wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4220 -0.5009 -0.0317  0.4697  3.2309 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.531e+01  6.493e+00  -6.978 3.40e-12 ***
## I(alcohol)            4.067e-01  1.508e-02  26.963  < 2e-16 ***
## density               4.642e+01  6.458e+00   7.189 7.51e-13 ***
## chlorides            -1.383e+00  5.399e-01  -2.562   0.0104 *  
## volatile.acidity     -2.108e+00  1.107e-01 -19.040  < 2e-16 ***
## total.sulfur.dioxide  6.805e-04  3.058e-04   2.225   0.0261 *  
## fixed.acidity        -9.927e-02  1.359e-02  -7.305 3.23e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7639 on 4891 degrees of freedom
## Multiple R-squared:  0.2569, Adjusted R-squared:  0.256 
## F-statistic: 281.8 on 6 and 4891 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • Higher alcohol content is undoubtedly responsible for better quality of wine.

  • Unlike alcohol’s negative correlation with fixed.acidity and citric.acid, it has a positive relationship with the volatile.acidity.

  • Low total.sulfur.dioxide is evidently better for wine.

Were there any interesting or surprising interactions between features?

Level of alcohol affects the density and residual.sugar content. But density and residual.sugar don’t seem to have much effect on the quality by themselves.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes. The fact that most of the wines in the data set belong to average quality, becomes a bottleneck in determining the variation between several variables. Low R squared value can be attributed to the 18% of alcohol content’s contribution to the quality of wine. Although the R squared value improves a bit when the other variables are factored in.


Final Plots and Summary

Plot One

Description One

Alcohol has a higher influence on the quality of wine. The mean and median for all the quality boxplots closely aligns, even for the higher quality wines where we have lesser data.

Plot Two

Description Two

After the quality of wine, it’s crucial to highlight the relation between alcohol and residual.sugar, alcohol and density. As sugar gets fermented it results an increase in the alcohol content and thereby affecting the density. It not only shows how strongly correlated they are but also that the negative correlation between the quality and density, quality and residual.sugar is primarily due to the alcohol level.

Plot Three

Description Three

With attributes like fixed.acidity, volatile.acidity and citric.acid in the wine, it was an interesting read on how pH varies with the quality. After an initial rise in the pH value (mostly for average and good wine), good quality wines almost have a constant pH with increasing level in alcohol.


Reflection

I chose the white wine data for my project. After going through the document, it was important to look for the factors that are influential to the wine taste and following are some of the things I encountered:

It might be interesting to analyze the red wine data separately or alongside the white wine data. Maybe I will be able to find how some of the chemical components behave differently for both the wines and how is the quality in turn affected.